Implement OpenMP offload for do group update by edoyango · Pull Request #1782 · NOAA-GFDL/FMS

edoyango · 2025-10-10T04:58:11Z

Description
This adds the necessary code to make mpp_do_group_update work with arrays that are managed by NVIDIA's OpenMP offload runtime. This attempts to be minimally disruptive in that non-nvidia compilers will see the same behaviour as previously by adding macros around the relevant openmp directives.

Fixes #1771

How Has This Been Tested?
The OpenMP offload capability is currently tested on the "double gyre" case in MOM6-examples using the nvfortran compiler and a cuda-aware openmpi. We have some notes on how to run the gpu-enabled MOM6, but is outdated.

Checklist:

My code follows the style guidelines of this project
I have performed a self-review of my own code
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
My changes generate no new warnings
Any dependent changes have been merged and published in downstream modules
New check tests, if applicable, are included
make distcheck passes

nikizadehgfdl · 2025-12-10T21:01:31Z

                 do i = is, ie
-                    pos = pos + 1
-                    field(i,j,k) = buffer(pos)
+                    idx = pos + (k-1)*nj*ni + (j-js)*ni + (i-is) + 1


How are these two implementations equivalent? Is new idx = old pos always?

Yes, they're equivalent. For any iteration, idx = pos + (k-1)*nj*ni + (j-js)*ni + (i-is) + 1 produces the same value that pos would have had at that point. The formula accounts for all the iterations that would have occurred in the nested loops up to that (i,j,k) position.

The reason for the change is that each nested iteration is now independent and can be performed in parallel.

marshallward · 2026-01-29T15:44:25Z

I've finally managed to put together a timing test for MOM6, and the timings for domain decomposition are promising. There is strong scaling for the larger domains as we saturate the GPU.

Scaling is shown for up to 8 GPUs over 4 nodes on Ursa, so this method should work on our NOAA resources.

Ursa does not have Nvlink, so its comm time shouldn't be taken as reflective of production runs. But I was surprised that it was not too large, even after going through the PCIe and Infiniband layers.

I will look into including mpp tests which exercise this new code.

edoyango · 2026-02-16T22:52:56Z

In 0cc2a77, compiler macros are replaced with the runtime OpenMP if to control whether CPU parallelism or GPU offload is used at runtime. However, removing the macros cause a significant slowdown in the packing and unpacking on GPU - even if the CPU parallelism is supposed to be disabled by the if clause.

For example, below halo communication times doing a self-communication (re-entrant domain) in MOM6 barotropic, where packing and unpacking are on GPU:

                                      hits          tmin          tmax          tavg          tstd  tfrac grain pemin pemax
(Ocean BT pre-step halo updates)       768    109.800896    109.800896    109.800896      0.000000  0.370    41     0     0
(Ocean BT stepping halo updates)       960     78.534592     78.534592     78.534592      0.000000  0.265    41     0     0

And with the CPU OpenMP directives not compiled via macros:

                                      hits          tmin          tmax          tavg          tstd  tfrac grain pemin pemax
(Ocean BT pre-step halo updates)       768      0.431521      0.431521      0.431521      0.000000  0.009    41     0     0
(Ocean BT stepping halo updates)       960      0.211485      0.211485      0.211485      0.000000  0.004    41     0     0

Hence, 0cc2a77 will be reverted.

bensonr

For some of the target omp statements, the if (use_device_ptr) is prior to the target data mapping and other times after. Does it matter where in the !$omp specification the if-test occurs?

The preference for !$omp is default(none) with full specification of shared and private.

marshallward · 2026-03-20T14:13:48Z

For some of the target omp statements, the if (use_device_ptr) is prior to the target data mapping and other times after. Does it matter where in the !$omp specification the if-test occurs?

AFAIK clause order is not important. I tend to prefer ending with if (similar to the one-line Python construct) but we can switch to your preference.

(Also note that some of the if (use_device_ptr) statements are actual Fortran if-blocks, rather than OMP directive clauses.)

The preference for !$omp is default(none) with full specification of shared and private.

👍

rem1776 · 2026-06-18T14:10:24Z

@edoyango Could you update this with main when you get a chance? Looks like its showing some conflicts that we should probably clear up before we get this reviewed.

edoyango · 2026-06-24T01:41:27Z

@edoyango Could you update this with main when you get a chance? Looks like its showing some conflicts that we should probably clear up before we get this reviewed.

Hi @rem1776 i've tried rebasing on top of main locally, but nvfortran still can't build main because of the use of select rank in mpp/include/mpp_pack.fh

rem1776 · 2026-06-24T15:08:21Z

@edoyango Could you update this with main when you get a chance? Looks like its showing some conflicts that we should probably clear up before we get this reviewed.

Hi @rem1776 i've tried rebasing on top of main locally, but nvfortran still can't build main because of the use of select rank in mpp/include/mpp_pack.fh

Sorry about that! Thought the fix (#1873) was merged but looks like its still waiting for a review. It should get merged sometime this week, I'll let you know.

* add multi gpu support * address review comments, add helpful comment for the acc/mp runbtime call

To enable this, had to be removed - otherwise segfaults happen on the GPU.

This reverts commit 0cc2a77. Having both the CPU and GPU OpenMP directives compiled caused a significant slowdown in GPU packing/unpacking performance - even if parallelism is controlled using OpenMP "if" clause.

Some very minor changes to the OpenMP target MPI PR: * use_device_ptr -> use_device_addr This appears to be the updated form (or at least nvfortran says it is) * Whitespace added to `!$ use omp_lib` Does not seem crucial but from our previous discussion it appears more correct. * Removal of some trailing whitespace.

This patch refactors several lines to keep within the 121-character line length limit prescribed by the FMS style guidelines.

The no-comm (no MPI) interface has been updated to support the new omp_offload argument.

This ensures that (un)packing steps in do_group_update is performed with openmp cpu parallelism if ompoffload=.false.. Previously it would only do serial. This is implemented by undefining the GPU macro (currently __NVCOMPILER_OPENMP_GPU) and re-including the (un)packing files. To make this work, the default(shared) was used in all the relevant OpenMP directives. If default(none) is used, the loops would hang or segfault.

Did't use the if(use_device_ptr) omp clause because the tests would hang. Instead made explicit branches. Left out mappings because gcc didn't like cray pointers in maps

edoyango · 2026-06-26T06:53:37Z

@rem1776 ok i've rebased on top of main as well as your select rank workaround and kept my tests.

I did notice that main + your workaround produces garbled terminal output with nvhpc though...

e.g. this is MOM6 standalone output:

P-�p@Xpr��������-�������@����+��:��;�pkhprL|�'�mgpr�:gp�gprP�\�Ɉhpr�gprpkhpr�\��\�gpr��\�Ɉhpr�
%���\���\��\�
             �ݞ@�gprL|�'�gpr(�\T�\�ohpr�gprL|�'p�\�P�\���hprpkhpr��\���hprohpr��\�pr
pkhpr�\��\�gpr��\�Ɉhpr�
%���\���\��\�
             �ݞ@�gprL|�'�gpr(�\T�\�ohpr�gprL|�'p�\�P�\���hprpkhpr��\���hprohpr��\�prNOTE: MPP_DOMAINS_SET_STACK_SIZE: stack size set to   955296.
S_SET_STACK_SIZE: stack size set to   955296.

Update test_mpp_domains to pass the OpenMP offload flag through the group update path and stage test data on the device for offloaded runs. Add explicit OpenMP maps and shared clauses for the offloaded setup and reference-copy loops, including shifted CGRID and BGRID extents. Unskip the group update OpenMP offload test in test_mpp_domains2.sh.

Commit 22dfe08 redirected the mpp_error_basic label into a new text_errortype variable but only copied it into the printed `text` inside the npes>1 branch. On single-PE runs `text` was left uninitialized, so mpp_error emitted garbled stack memory. Seed `text` from text_errortype on the npes==1 path.

edoyango · 2026-07-01T02:47:19Z

@rem1776 ok i've rebased on top of main as well as your select rank workaround and kept my tests.

I did notice that main + your workaround produces garbled terminal output with nvhpc though...

e.g. this is MOM6 standalone output:

P-�p@Xpr��������-�������@����+��:��;�pkhprL|�'�mgpr�:gp�gprP�\�Ɉhpr�gprpkhpr�\��\�gpr��\�Ɉhpr�
%���\���\��\�
             �ݞ@�gprL|�'�gpr(�\T�\�ohpr�gprL|�'p�\�P�\���hprpkhpr��\���hprohpr��\�pr
pkhpr�\��\�gpr��\�Ɉhpr�
%���\���\��\�
             �ݞ@�gprL|�'�gpr(�\T�\�ohpr�gprL|�'p�\�P�\���hprpkhpr��\���hprohpr��\�prNOTE: MPP_DOMAINS_SET_STACK_SIZE: stack size set to   955296.
S_SET_STACK_SIZE: stack size set to   955296.

this turned out to be an issue with an uninitialized msg variable being printed when pe == 1. Creating an else branch to handle pe == 1 case solves it. It's small, so I've just added it to this PR.

rem1776 · 2026-07-01T20:27:43Z

@edoyango Thanks for putting that mpp_error fix in! i think i was seeing something similar with gcc but i like your fix better. I'm going it mark this PR as ready for review so we can try to include it in our next tag.

edoyango force-pushed the ompoffload branch from 3346bf4 to 3e3da6e Compare October 10, 2025 05:32

nikizadehgfdl reviewed Dec 10, 2025

View reviewed changes

bensonr reviewed Mar 19, 2026

View reviewed changes

Comment thread mpp/include/group_update_pack.inc

bensonr reviewed Apr 16, 2026

View reviewed changes

Comment thread mpp/include/mpp_transmit.inc

rem1776 and others added 20 commits June 26, 2026 16:41

add select-rank-less versions of mpp_pack and mpp_global_field

68f95a0

quick fix for class(*) in diag_data

e0be911

remove whitespace

f247654

fix indent in diag_data

f41c8ac

add gpu2gpu mpi transer with flag for do_group_update

e523d22

add missing collapse(3) clauses

c2a0391

Use __NVCOMPILER macro for target regions

e0172fd

add back old omp directive wrapped in #ifndef __NVCOMPILER

3947369

port remaining un/pack loops

d622346

add multi gpu support (#2)

57b1989

* add multi gpu support * address review comments, add helpful comment for the acc/mp runbtime call

sub __NVCOMPILER with __NVCOMPILER_OPENMP_GPU

715763e

allow choice of gpu or cpu parallel

72fbc7b

To enable this, had to be removed - otherwise segfaults happen on the GPU.

fix omp set device call

8b84417

Revert "allow choice of gpu or cpu parallel"

fc36b70

This reverts commit 0cc2a77. Having both the CPU and GPU OpenMP directives compiled caused a significant slowdown in GPU packing/unpacking performance - even if parallelism is controlled using OpenMP "if" clause.

OMP target MPI: line length compliance

af47ce2

This patch refactors several lines to keep within the 121-character line length limit prescribed by the FMS style guidelines.

OMP MPI: Update nocomm interface

dab46c3

The no-comm (no MPI) interface has been updated to support the new omp_offload argument.

remove trailing whitespace in comments

9b45a30

add changes for group update test

f883340

edoyango added 2 commits June 26, 2026 16:41

fix k outside path and normalise omp offload clauses

93e03ac

offload folded-north post-processing

11f93e7

Did't use the if(use_device_ptr) omp clause because the tests would hang. Instead made explicit branches. Left out mappings because gcc didn't like cray pointers in maps

edoyango force-pushed the ompoffload branch from 82fe869 to 7f0f3c2 Compare June 26, 2026 06:50

edoyango force-pushed the ompoffload branch from 7f0f3c2 to 7b28243 Compare June 29, 2026 22:22

rem1776 approved these changes Jul 1, 2026

View reviewed changes

rem1776 mentioned this pull request Jul 1, 2026

fix mpp_error output bug #1883

Closed

8 tasks

rem1776 marked this pull request as ready for review July 1, 2026 20:28

rem1776 requested review from uramirez8707 and vithikashah001 as code owners July 1, 2026 20:28

vithikashah001 approved these changes Jul 2, 2026

View reviewed changes

Uh oh!

Conversation

edoyango commented Oct 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

nikizadehgfdl Dec 10, 2025

Choose a reason for hiding this comment

Uh oh!

edoyango Dec 10, 2025

Choose a reason for hiding this comment

Uh oh!

marshallward commented Jan 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

edoyango commented Feb 16, 2026

Uh oh!

bensonr left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

marshallward commented Mar 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

rem1776 commented Jun 18, 2026

Uh oh!

edoyango commented Jun 24, 2026

Uh oh!

rem1776 commented Jun 24, 2026

Uh oh!

edoyango commented Jun 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

edoyango commented Jul 1, 2026

Uh oh!

rem1776 commented Jul 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

edoyango commented Oct 10, 2025 •

edited

Loading

marshallward commented Jan 29, 2026 •

edited

Loading

marshallward commented Mar 20, 2026 •

edited

Loading

edoyango commented Jun 26, 2026 •

edited

Loading